Project Description

The gym chain Model Fitness is developing a customer interaction strategy based on analytical data.One of the most common problems gyms and other services face is customer churn. How do you know if a customer is no longer with you? You can calculate churn based on people who get rid of their accounts or don't renew their contracts. However, sometimes it's not obvious that a client has left: they may walk out on tiptoes. Churn indicators vary from field to field. If a user buys from an online store rarely but regularly, you can't say they're a runaway. But if for two weeks they haven't opened a channel that's updated daily, that's a reason to worry: your follower might have gotten bored and left you.

For a gym, it makes sense to say a customer has left if they don't come for a month. Of course, it's possible they're in Cancun and will resume their visits when they return, but's that's not a typical case. Usually, if a customer joins, comes a few times, then disappears, they're unlikely to come back.

In order to fight churn, Model Fitness has digitized a number of its customer profiles. Your task is to analyze them and come up with a customer retention strategy.

We will use data from gym_churn_us.csv

Step 1. Download the data

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
import plotly.express as px
from sklearn.cluster import KMeans
from plotly.subplots import make_subplots
import math
import warnings
warnings.filterwarnings('ignore')
from plotly import graph_objects as go
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.metrics import silhouette_score

In [2]:
data = pd.read_csv('/datasets/gym_churn_us.csv')
data.sample()
Out[2]:
gender Near_Location Partner Promo_friends Phone Contract_period Group_visits Age Avg_additional_charges_total Month_to_end_contract Lifetime Avg_class_frequency_total Avg_class_frequency_current_month Churn
148 0 0 1 0 1 6 0 30 208.806185 6.0 3 3.041605 3.110077 0
In [3]:
data.columns = map(str.lower, data.columns)
data.head()
Out[3]:
gender near_location partner promo_friends phone contract_period group_visits age avg_additional_charges_total month_to_end_contract lifetime avg_class_frequency_total avg_class_frequency_current_month churn
0 1 1 1 1 0 6 1 29 14.227470 5.0 3 0.020398 0.000000 0
1 0 1 0 0 1 12 1 31 113.202938 12.0 7 1.922936 1.910244 0
2 0 1 1 0 1 1 0 28 129.448479 1.0 2 1.859098 1.736502 0
3 0 1 1 1 1 12 1 33 62.669863 12.0 2 3.205633 3.357215 0
4 1 1 1 1 1 1 0 26 198.362265 1.0 3 1.113884 1.120078 0

Step 2. Carry out exploratory data analysis (EDA)

  • Look at the dataset: does it contain any missing features? Study the mean values and standard deviation (use the describe() method).
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 14 columns):
gender                               4000 non-null int64
near_location                        4000 non-null int64
partner                              4000 non-null int64
promo_friends                        4000 non-null int64
phone                                4000 non-null int64
contract_period                      4000 non-null int64
group_visits                         4000 non-null int64
age                                  4000 non-null int64
avg_additional_charges_total         4000 non-null float64
month_to_end_contract                4000 non-null float64
lifetime                             4000 non-null int64
avg_class_frequency_total            4000 non-null float64
avg_class_frequency_current_month    4000 non-null float64
churn                                4000 non-null int64
dtypes: float64(4), int64(10)
memory usage: 437.6 KB
In [5]:
data.describe()
Out[5]:
gender near_location partner promo_friends phone contract_period group_visits age avg_additional_charges_total month_to_end_contract lifetime avg_class_frequency_total avg_class_frequency_current_month churn
count 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000
mean 0.510250 0.845250 0.486750 0.308500 0.903500 4.681250 0.412250 29.184250 146.943728 4.322750 3.724750 1.879020 1.767052 0.265250
std 0.499957 0.361711 0.499887 0.461932 0.295313 4.549706 0.492301 3.258367 96.355602 4.191297 3.749267 0.972245 1.052906 0.441521
min 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 18.000000 0.148205 1.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 27.000000 68.868830 1.000000 1.000000 1.180875 0.963003 0.000000
50% 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 29.000000 136.220159 1.000000 3.000000 1.832768 1.719574 0.000000
75% 1.000000 1.000000 1.000000 1.000000 1.000000 6.000000 1.000000 31.000000 210.949625 6.000000 5.000000 2.536078 2.510336 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 12.000000 1.000000 41.000000 552.590740 12.000000 31.000000 6.023668 6.146783 1.000000

In [6]:
print('Duplicates amount in the dataframe:', data.duplicated().sum())
Duplicates amount in the dataframe: 0
In [7]:
data.isna().sum()
Out[7]:
gender                               0
near_location                        0
partner                              0
promo_friends                        0
phone                                0
contract_period                      0
group_visits                         0
age                                  0
avg_additional_charges_total         0
month_to_end_contract                0
lifetime                             0
avg_class_frequency_total            0
avg_class_frequency_current_month    0
churn                                0
dtype: int64

In [8]:
data.isnull().sum()
Out[8]:
gender                               0
near_location                        0
partner                              0
promo_friends                        0
phone                                0
contract_period                      0
group_visits                         0
age                                  0
avg_additional_charges_total         0
month_to_end_contract                0
lifetime                             0
avg_class_frequency_total            0
avg_class_frequency_current_month    0
churn                                0
dtype: int64

There are neither missing nor duplicated values

  • Look at the mean feature values in two groups: for those who left (churn) and for those who stayed (use the groupby() method).
In [10]:
data1 = data.groupby('churn').mean().reset_index()
data1
Out[10]:
churn gender near_location partner promo_friends phone contract_period group_visits age avg_additional_charges_total month_to_end_contract lifetime avg_class_frequency_total avg_class_frequency_current_month
0 0 0.510037 0.873086 0.534195 0.353522 0.903709 5.747193 0.464103 29.976523 158.445715 5.283089 4.711807 2.024876 2.027882
1 1 0.510839 0.768143 0.355325 0.183789 0.902922 1.728558 0.268615 26.989632 115.082899 1.662582 0.990575 1.474995 1.044546

  • Plot bar histograms and feature distributions for those who left (churn) and those who stayed.
In [11]:
subplot_titles=('gender', 'near_location', 'partner', 'promo_friends','phone', 'contract_period',
                'group_visits', 'age','additional_charges_total', 'month_to_end_contract',
                'lifetime', 'class_freq_total','class_freq_current_month')

fig = make_subplots(rows=7, cols=2,
                   subplot_titles=subplot_titles,vertical_spacing=0.04)

idx = 0
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = True

for i in data.columns.values[0:13]:
    fig.add_trace(go.Histogram(x=data.query('churn == "1"')[i],
                               name='churn=yes', legendgroup='churn',
                               marker = {'color':'DodgerBlue'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data.query('churn == "0"')[i],
                               name='churn=no', legendgroup='nochurn',
                               marker = {'color':'DarkTurquoise'},
                               showlegend=legend),
                  row=r, col=c)
    idx = idx+1
    r = (math.floor(idx/2) + 1)
    c = (idx%2 + 1)
    legend = False

fig.update_xaxes(type="category", row=3, col=2)
fig.update_layout(height=1700)
fig.show()

EDA conclusions:

can be spotted that the staying group (not churn) has highier means at most of the features such as contract periode, group visits and partner. Also they spent more money on additional services, renewed their subscription and visit more often. The age and gender parameters have less influence on the churn. In addition, can be seen that age distributes normaly, class_freq_current_month wich stays very close to normal distribution *

  • Build a correlation matrix and display it.
In [12]:
corr_data = data.corr()
In [13]:
plt.figure(figsize = (15,15))
plt.title('Correlation Matrix of Features', fontsize=18)
sns.heatmap(corr_data, square=True, annot=True)
plt.show()

There are strong correlations between:

  • average frequency of visits per week over the preceding month and average frequency of visits per week over the customer's lifetime
  • contract period and the months remaining until the contract expires

Step 3. Build a model to predict user churn

Build a binary classification model for customers where the target feature is the user's leaving next month. Divide the data into train and validation sets using the train_test_split() function. Train the model on the train set with two methods:

  • logistic regression
  • random forest
In [14]:
X = data.drop('month_to_end_contract', axis = 1)
y = data['month_to_end_contract']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [15]:
# define the algorithm for the logistic regression model
lr_model = LogisticRegression()
In [16]:
# train the model
lr_model.fit(X_train, y_train)
# use the trained model to make predictions
lr_predictions = lr_model.predict(X_test)
lr_probabilities = lr_model.predict_proba(X_test)[:,1]
In [17]:
# define the algorithm for the new random forest model
rf_model = RandomForestClassifier(n_estimators = 100) 
# train the random forest model
rf_model.fit(X_train, y_train)
# use the trained model to make predictions
rf_predictions = rf_model.predict(X_test)
rf_probabilities = rf_model.predict_proba(X_test)[:,1] 

  • Evaluate accuracy, precision, and recall for both models using the validation data. Use them to compare the models. Which model gave better results?
In [18]:
print('Metrics for logistic regression:')
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, lr_predictions)))
print('Precision: {:.2f}'.format(precision_score(y_test, lr_predictions,average='weighted')))
print('Recall: {:.2f}'.format(recall_score(y_test, lr_predictions,average='weighted')))
print('F1: {:.2f}'.format(f1_score(y_test, lr_predictions,average='weighted')))
print('Metrics for random forest:')
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, rf_predictions)))
print('Precision: {:.2f}'.format(precision_score(y_test, rf_predictions,average='weighted')))
print('Recall: {:.2f}'.format(recall_score(y_test, rf_predictions,average='weighted')))
print('F1: {:.2f}'.format(f1_score(y_test, rf_predictions,average='weighted')))
Metrics for logistic regression:
Accuracy: 0.85
Precision: 0.75
Recall: 0.85
F1: 0.79
Metrics for random forest:
Accuracy: 0.83
Precision: 0.76
Recall: 0.83
F1: 0.78

Accuracy, Recall and F1 metrics are highier in logistic regression model

Step 4. Create user clusters

Set aside the column with data on churn and identify object (user) clusters:

  • Standardize the data.
  • Use the linkage() function to build a matrix of distances based on the standardized feature matrix and plot a dendrogram. Use the resulting graph to estimate the number of clusters you can single out.
In [19]:
sc = StandardScaler()
x_sc = sc.fit_transform(X)
linked = linkage(x_sc, method = 'ward')
In [20]:
sc = StandardScaler()
x_sc = sc.fit_transform(X)
linked = linkage(x_sc, method = 'ward')
plt.figure(figsize=(15, 10))  
dendrogram(linked, orientation='top')
plt.title('Hierarchical clustering for GYM')
plt.show()

By looking the resulting graph we can estimate that there are 5 clusters that we can single out.

  • Train the clustering model with the K-means algorithm and predict customer clusters. (Let the number of clusters be n=5, so that it'll be easier to compare your results with those of other students. However, in real life, no one will give you such hints, so you'll have to decide based on the graph from the previous step.)
  • Look at the mean feature values for clusters. Does anything catch your eye?
In [21]:
# standardize the data
sc = StandardScaler()
x_sc = sc.fit_transform(X)
# define the k_means model with 5 clusters
km = KMeans(n_clusters = 5)
# predict the clusters for observations (the algorithm assigns them a number from 0 to 4)
labels = km.fit_predict(x_sc)
data_for_hists = data.copy()
In [22]:
# store cluster labels into the field of our dataset
#data_for_hists = data.copy()
data_for_hists['cluster_km'] = labels
# print the statistics of the mean feature values per cluster
data_for_hists.groupby(['cluster_km']).mean()
Out[22]:
gender near_location partner promo_friends phone contract_period group_visits age avg_additional_charges_total month_to_end_contract lifetime avg_class_frequency_total avg_class_frequency_current_month churn
cluster_km
0 0.526042 0.864583 0.473958 0.307292 0.000000 4.768229 0.427083 29.296875 144.096612 4.466146 3.932292 1.851384 1.720629 0.265625
1 0.501160 0.970998 0.914153 0.975638 1.000000 7.787703 0.563805 29.831787 157.767923 7.084687 4.661253 2.004259 2.001215 0.013921
2 0.551020 0.836735 0.326531 0.059076 0.998926 4.915145 0.440387 29.972073 159.240048 4.510204 4.712137 2.907233 2.912765 0.004296
3 0.508995 0.761905 0.342857 0.177778 0.998942 1.646561 0.256085 26.901587 115.450702 1.590476 0.975661 1.447700 1.023574 0.996825
4 0.470387 0.812073 0.397494 0.059226 1.000000 4.611617 0.395216 30.120729 158.419664 4.290433 4.626424 1.142102 1.142798 0.001139

We can spot that the highiest contract period is on cluster 4, also month to end contract is the highiest on the same cluster, it means that recently,the contracts renewed and they are also under high partner category. Also 4th cluster has a pretty high lifetime - they are repeat customers

  • Plot distributions of features for the clusters.
In [23]:
# binary features plots
groupmode = ['gender','near_location','partner','promo_friends','phone','contract_period',
             'group_visits','month_to_end_contract']

fig = make_subplots(rows=4, cols=2,
                   subplot_titles=groupmode,vertical_spacing=0.07)


idx = 0
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = True

for i in groupmode:
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 3')[i],
                               name='cluster3', legendgroup='cluster3',
                               marker = {'color':'MediumSlateBlue'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 0')[i],
                               name='cluster0', legendgroup='cluster0',
                               marker = {'color':'DodgerBlue'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 4')[i],
                               name='cluster4', legendgroup='cluster4',
                               marker = {'color':'DarkTurquoise'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 2')[i],
                               name='cluster2', legendgroup='cluster2',
                               marker = {'color':'PaleGreen'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 1')[i],
                               name='cluster1', legendgroup='cluster1',
                               marker = {'color':'LemonChiffon'},
                               showlegend=legend),
                  row=r, col=c)
    idx = idx+1
    r = (math.floor(idx/2) + 1)
    c = (idx%2 + 1)
    legend = False

fig.update_xaxes(type="category", row=3, col=2)
fig.update_layout(barmode='group', height=1200)
fig.show()

# continuous features plots
overlaymode = ['age','avg_additional_charges_total','lifetime','avg_class_frequency_total',
               'avg_class_frequency_current_month']

fig = make_subplots(rows=3, cols=2,
                   subplot_titles=overlaymode,vertical_spacing=0.07)

idx = 0
r = (math.floor(idx/2) + 1)
c = (idx%2 + 1)
legend = True

for i in overlaymode:
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 3')[i],
                               name='cluster3', legendgroup='cluster3',
                               marker = {'color':'MediumSlateBlue'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 0')[i],
                               name='cluster0', legendgroup='cluster0',
                               marker = {'color':'DodgerBlue'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 4')[i],
                               name='cluster4', legendgroup='cluster4',
                               marker = {'color':'DarkTurquoise'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 2')[i],
                               name='cluster2', legendgroup='cluster2',
                               marker = {'color':'PaleGreen'},
                               showlegend=legend),
                  row=r, col=c)
    fig.add_trace(go.Histogram(x=data_for_hists.query('cluster_km == 1')[i],
                               name='cluster1', legendgroup='cluster1',
                               marker = {'color':'LemonChiffon'},
                               showlegend=legend),
                  row=r, col=c)
    idx = idx+1
    r = (math.floor(idx/2) + 1)
    c = (idx%2 + 1)
    legend = False

fig.update_layout(barmode='overlay', height=900)
fig.update_traces(opacity=0.75)
fig.show()

Can be seen that cluster 3 are the oldest by age and are also visiting the most. Cluster zero are the youngest and are visiting much less and also have the shortest lifetime

  • Calculate the churn rate for each cluster (use the groupby() method). Do they differ in terms of churn rate? Which clusters are prone to leaving, and which are loyal?
In [24]:
table = data_for_hists.groupby('cluster_km')['churn'].agg(count='count',sum='sum')
table['churn_rate'] = (table['sum']/table['count']*100).round(2)
table
Out[24]:
count sum churn_rate
cluster_km
0 384 102 26.56
1 862 12 1.39
2 931 4 0.43
3 945 942 99.68
4 878 1 0.11
In [25]:
churn_rate_data = data_for_hists.merge(table, on='cluster_km')
churn_rate_data.groupby(['cluster_km']).mean()
Out[25]:
gender near_location partner promo_friends phone contract_period group_visits age avg_additional_charges_total month_to_end_contract lifetime avg_class_frequency_total avg_class_frequency_current_month churn count sum churn_rate
cluster_km
0 0.526042 0.864583 0.473958 0.307292 0.000000 4.768229 0.427083 29.296875 144.096612 4.466146 3.932292 1.851384 1.720629 0.265625 384.0 102.0 26.56
1 0.501160 0.970998 0.914153 0.975638 1.000000 7.787703 0.563805 29.831787 157.767923 7.084687 4.661253 2.004259 2.001215 0.013921 862.0 12.0 1.39
2 0.551020 0.836735 0.326531 0.059076 0.998926 4.915145 0.440387 29.972073 159.240048 4.510204 4.712137 2.907233 2.912765 0.004296 931.0 4.0 0.43
3 0.508995 0.761905 0.342857 0.177778 0.998942 1.646561 0.256085 26.901587 115.450702 1.590476 0.975661 1.447700 1.023574 0.996825 945.0 942.0 99.68
4 0.470387 0.812073 0.397494 0.059226 1.000000 4.611617 0.395216 30.120729 158.419664 4.290433 4.626424 1.142102 1.142798 0.001139 878.0 1.0 0.11

The zero cluster has the highiest churn rate of 51.52%, cluster 2 follows him with 44.44% and cluster 1 with 26.68%. Clusters 3 and 4 have lowest churn rate of 7.24% and 2.76% respectively. So clusters 3 and 4 are the most loyal and 0, 2, 1 are prone to leaving

General conclusion

There are strong correlations between:

  • average frequency of visits per week over the preceding month and average frequency of visits per week over the customer's lifetime
  • contract period and the months remaining until the contract expires

Accuracy, Recall and F1 metrics are highier in logistic regression model

There are 5 clusters and highiest contract period is on cluster 4, also month to end contract is the highiest on the same cluster, it means that recently,the contracts renewed and they are also under high partner category. Also 4th cluster has a pretty high lifetime - they are repeat customers. Also can be seen that cluster 3 are the oldest by age and are also visiting the most. Cluster zero are the youngest and are visiting much less and also have the shortest lifetime. The zero cluster has the highiest churn rate of 51.52%, cluster 2 follows him with 44.44% and cluster 1 with 26.68%. Clusters 3 and 4 have lowest churn rate of 7.24% and 2.76% respectively. So clusters 3 and 4 are the most loyal and 0, 2, 1 are prone to leaving.